% \subsubsection{Previous results on basic lexical features}
\subsubsection{Background}


In previous studies \cite{Lippincott:2011} biomedical subdomains have been compared in terms of basic lexical features (verb, noun, adverb and adjective lemmas, part-of-speech tags, etc) and using topic and selectional preference modeling methods.  The results often contrast with those of the current paper, and we briefly review them here for easier comparison.

It was found that subdomains formed stable clusters in terms of basic lexical behavior, and several recurrent clusters were identified, shown in Table \ref{clusters}.  The first cluster includes subdomains dealing primarily with microscopic processes and can be further subdivided into groupings of biochemical (\emph{Biochemistry, Genetics}) and cellular (\emph{Cell Biology}, \emph{Embryology}) study.  The second cluster includes subdomains focused on specific anatomical systems (\emph{Endocrinology}, \emph{Pulmonary Medicine}).  The third cluster includes subdomains focused on clinical medicine (\emph{Psychiatry}) or specific patient-types (\emph{Geriatrics}, \emph{Pediatrics}).  The fourth and final cluster includes subdomains focused on social and ethical aspects of medicine (\emph{Ethics}, \emph{Education}).

\begin{table*}[h]
%  \tiny
%  \begin{tabular}{|l|l|l|l|l|}
  \begin{tabular}{|p{0.2\textwidth}|p{0.2\textwidth}|p{0.2\textwidth}|p{0.2\textwidth}|p{0.2\textwidth}|}
    \hline
    \multicolumn{2}{|c|}{\emph{Microscopic}} & & & \\
\cline{1-2}
    \emph{Cellular}     & \emph{Biochemical}        & \emph{System-specific}    & \emph{Clinical}           & \emph{Social} \\
    \hline 
    Cell Biology & Biochemistry       & Endocrinology      & Geriatrics         & Ethics \\
    Virology     & Molecular Biology  & Rheumatology       & Pediatrics         & Education \\
    Microbiology & Genetics           & Pulmonary Medicine & Psychiatry         & \\
    Embryology   &                    &                    & Obstetrics         & \\
    \hline
  \end{tabular}
  \caption{Common subdomain clusters when considering lexical features.}
  \label{clusters}
\end{table*}

Almost all variation was significant at a high (\textgreater .99) level, supporting the intuition that lexical features such as vocabulary are primary aspects of different subdomains.  It was also noted that the handful of syntactic features considered, such as average sentence length and grammatical relation types, did not necessarily align with the more stable lexical clusters.  Verbs showed a mixture of the syntactic and lexical variation, reflecting their combined semantic and structural roles.

\subsubsection{Verb subcategorization behavior}
\label{subdomain_subcat_behavior}

There was great variety in the amount of variation each verb exhibited between subdomains.  For example, the verb {\it induce} has a maximum JSD of .07 (between Botany and Physiology), while {\it develop} has a maximum of .62 (between Embryology and Therapeutics).  Similarly, some verbs shift behavior in just one or two subdomains (e.g.~{\it activate} in Molecular Biology and Biochemistry) while others are broadly heterogeneous (e.g.~{\it predict}).

In contrast to the lexical results, verb subcategorization tends to show small pockets of specialized behavior, and the distinction between microscopic, systemic, clinical and social subdomains is less consistent.  Instead, there are specific cases where verbs have taken on a specific usage in a single subdomain.  The clearest example of this is {\it develop} (Figure \ref{develop:hm}), which has the distinct emphasis on intransitive usage INTRANS in Embryology (``The fetus develops''), compared to it typical transitive usage NP in other subdomains (``He developed a tumor'').  A similar example is the verb {\it express}, which takes NP-AS-NP-SC (``He expressed X as Y'') frequently in most subdomains, but not in Genetics and Cell Biology, where the simple transitive NP is unusually common.

Sometimes the reasons for specialized behavior are not so obvious: {\it perform} behaves differently in Medical Informatics and Education.  Both subdomains show unusually high usage of NP-PRED-RS, and Education is unique in its frequent use of TRANS.  

Not all verb behavior follows the pattern of extreme specialization in one or two subdomains: the heatmap for {\it predict} (Figure \ref{predict:hm}), for example, is extremely diverse.  Looking at the corresponding dendrogram (Figure \ref{predict:dend}) shows a clear distinction between system-specific and clinical subdomains in the top half, and the microscopic subdomains in the bottom half.  The top SCFs (Table \ref{predict:table}) show that the microscopic subdomains use {\it predict} in conjunction with infinitival forms (e.g.~NP-TOBE, ``We predicted it to be'').  {\it Recognize}, like {\it predict}, shows a diverse set of JSD values.  It is unclear why some subdomains prefer e.g.~THAT-S or NP-AS-NP, except perhaps that diagnosis-oriented subdomains prefer the latter.

Some verbs may have more than one specialized behavior: {\it treat} is
generally either used in a clinical sense (NP-FOR-NP, ``We
treat the patient for concussion'') or with raising
(NP-AS-NP-SC, ``We treat the infection as a separate issue'').
The most distinct subdomain, Public Health, appears as an outlier
because of its unique combination of both usages.  This is an example
of a heterogeneous subdomain merging SCF behaviors into a third,
unique distribution.



%Different subcategorization behavior is often visible only by considering the second or third most-frequent SCFs.  This suggests that many verbs retain much stable behavior between subdomains, and vary in a subset of instances.

%The clinical-oriented domains show regular variation in verb behavior.  This was also evident in previous studies of lexical and syntactic features, but with subcategorization it becomes more essential.

%``recognize''
%While most subdomains prefer SCF 50 (``We recognized the issue of utmost importance''), Pulmonary Medicine and Vascular Disease, which focus on medicine of certain systems, prefer SCF 158 (``We recognized that he underwent cardiac arrest'').
%``perform''\ref{perform:hm} in Education (``perform well'' vs. ``perform surgery'').
%``predict''


%\subsubsection{Information extraction}
%Specific changes in verb subcategorization can affect performance in the crucial area of information extraction.  For example, sentences of the form ``other factors X activates to suppress Y'' are far more common in subdomains like Genetics and Biochemistry, and have subdomain-specific interpretations of the argument structure.  An out-of-domain lexicon would fail to extract the full import of such a sentence.
